Project 1 - Part 1 - Goals Scored at Each World Cup
This analysis involved a dataset in tidytuesday that fascinated me as a soccer enthusiast, titled “World Cup”. The data included information on results, location, matches played and goals scored between 1930 and 2018, the most recent World Cup when the dataset was created.
I chose to visualize the number of goals scored at each tournament dating back to 1930.
Project 1 - Part 2 - Chocolate Ratings by Percent Cocoa
In this analysis, I chose another dataset that intrigued me, titled “Chocolate Ratings”. I visualized the data with an aim to find the optimal percent of cocoa in a chocolate bar, in terms of ratings.
Visualization
Here is the piped data that I used. I had to convert percentages to numerics so I could order the data correctly.
In this project, I focused on data involving Netflix titles and made three visualizations. The purpose of this project was to use piping to organize data, and focus on the use of regular expressions to assist in this process, so it is easier to understand and present.
Visualization 1
Firstly, I wanted to find the most common words in titles of Netflix shows and TV shows, excluding filler words like “the” and “and”. This is the organized dataset that was used for my visualization.
# A tibble: 10 × 2
# Groups: words [10]
words n
<chr> <int>
1 love 152
2 my 127
3 you 81
4 man 79
5 christmas 78
6 world 69
7 story 67
8 life 66
9 movie 60
10 little 58
Visualization 2
Secondly, I compared the number of titles on Netflix that were movies or TV Shows, and their release dates.
# A tibble: 118 × 3
# Groups: release_year [73]
release_year type count
<dbl> <chr> <int>
1 1925 TV Show 1
2 1942 Movie 2
3 1943 Movie 3
4 1944 Movie 3
5 1945 Movie 3
6 1946 Movie 1
7 1946 TV Show 1
8 1947 Movie 1
9 1954 Movie 2
10 1955 Movie 3
# ℹ 108 more rows
Visualization 3
Finally, I decided to find the percent of titles that either contain a digit anywhere in their title, start with “the” or have the word “the” anywhere and showed this information on a bar graph. This turned out to be far more difficult than I expected. I found it hard to name the axis labels separate from the variable names, which were sometimes confusing, as you can imagine.
Project 3 - Generational Marijuana Use
This project involved running permutation tests assuming a null hypothesis to test whether a relationship exists between two variables. In my analysis, I tested the relationship between parental and child use of marijuana. This was my favorite project. I really enjoyed running the statistical analysis and demonstrating the data.
Visualization
I first found the percentage difference between students’ use of marijuana if their parents’ used it vs. if they didn’t. There is a 19.5% higher chance that a student uses marijuana if their parents’ did.
# A tibble: 445 × 2
student parents
<fct> <fct>
1 uses used
2 uses used
3 uses used
4 uses used
5 uses used
6 uses used
7 uses used
8 uses used
9 uses used
10 uses used
# ℹ 435 more rows
[1] 0.1952381
Then I ran 1000 permutation tests to find the statistical likelihood of this happening without a relationship. i.e. assuming the null hypothesis of no relationship. On this graph you can see the results of the analysis, where the red line represents the proportional difference in the actual data.
Using this data I found a p-value of 0.
Project 4 - SQL Analysis of WAI Data for Auditory Research
In this project, I aimed to recreate a graph that demonstrates the relationship between the mean absorbance of sound and the frequency at which it is played. I then made my own graph, comparing mean absorbance across frequencies for people who identify as males, females or unknown sexes in studies conducted by Abur in 2004.
SELECT Measurements.Identifier,COUNT(DISTINCTCONCAT(Measurements.SubjectNumber, Measurements.Ear)) AS Unique_Ears, PI_Info.AuthorsShortList, Measurements.Instrument, Measurements.Frequency,AVG(Measurements.Absorbance) AS MeanAbsorbance,CONCAT(PI_Info.AuthorsShortList, ' et al. N=', COUNT(DISTINCTCONCAT(Measurements.SubjectNumber, Measurements.Ear)), ', ', Measurements.Instrument) AS LegendLabelFROM MeasurementsJOIN PI_Info ON Measurements.Identifier= PI_Info.IdentifierWHERE Measurements.IdentifierIN ('Abur_2014', 'Feeney_207', 'Groon_2015', 'Lewis_2015', 'Liu_2008', 'Rosowski_2012', 'Shahnaz_2006', 'Shaver_2013', 'Sun_2016', 'Voss_1994', 'Voss_2010', 'Werner_2010')AND Measurements.Frequency >=200GROUPBY Measurements.Identifier, Measurements.Instrument, PI_Info.AuthorsShortList, Measurements.Frequency;
Visualization 2
In my second visualization I compare the mean absorbances across the study conducted by Abur in 2004. I differentiate between Men, Women and the studies where the sex of the subject was unknown.